This section allows the user to set parameters that are specific to the dataset to be analysed (e.g., file path to count matrix, group specifications, desired pairwise group comparisons, etc.). This is the only section of the script that requires active user input. If the script produces errors down the line, it is most likely due to incorrect parameter specifications!
## specify working directory
wd = getwd()
## specify file path to data input (tab-separated count matrix + some extra columns denoting gene ids and gene names etc.)
filepath = "./DemoDataset_countMatrix.tsv"
## specify unique gene id column name
unique_id = "gene_id"
## specify gene name column
gene_name = "gene_name"
## specify count column regular expression pattern
count_column_pattern = "_S[0-9]*$"
## specify group names
groups = rep(c("KO_effector","WT_naive", "WT_effector", "KO_naive"), each=3)
## specify sample names in the order they are written in the count table
samplenames = c("KO_effector_10", "KO_effector_11", "KO_effector_12",
"WT_naive_1", "WT_naive_2", "WT_naive_3",
"WT_effector_4", "WT_effector_5", "WT_effector_6",
"KO_naive_7", "KO_naive_8", "KO_naive_9")
## specify batches as factor (if present. Else: set to NULL)
batch = NULL
## specify reordering indices as numerical vector for clustering sample visualization (if wanted. Else: set to NULL)
reorder = c(4,5,6,10,11,12,7,8,9,1,2,3)
## specify minimum number read counts per gene across all samples. Genes with total count below the threshold will be filtered out.
min_counts_threshold = 50
## define number of k-means clusters
k_clusters = 7
## specify if log2 counts should be standardized within each gene for the clustering (If FALSE, they will only centered)
standardize_before_clustering = TRUE
## specify if genes should be additionally filtered for overall significance (H0: no difference between groups) before clustering. Recommended if standardize_before_clustering = TRUE
ANOVA_filter_before_clustering = TRUE
## specify at which p-value cutoff the ANOVA_filter_for_clustering should be applied
adj.pval_cutoff = 0.05
## specify if enrichment analysis (ORA, using gProfiler) on k-means clusters should be performed
enrichment_analysis = TRUE
## specify if enrichment analysis (ORA, using gProfiler) should be performed against a custom background (i.e. all the genes quantified in the experiment). Else, enrichment is tested against the whole annotated genome of the specified organism. Note that this can produce different results!
custom_background = TRUE
## define organism for enrichment analysis (note: "mmusculus" for mouse, "hsapiens" for human, etc)
organism = "mmusculus"
## define adj. pval-threshold for enrichment testing via g:Profiler
pval_threshold_enrichment = 0.05
## define the database sources from which annotated gene sets are tested for enrichment
sources_enrichment = c("GO", "KEGG", "REAC", "WP", "CORUM")
## specifiy the column that indicates whether a geine is protein coding or not
biotype_colname <- "gene_biotype"
## specify the name that, in the above column, indicates a protein coding gene. Only protein-coding genes will be taken into account for enrichment testing
proteincoding_entry <- "protein_coding"
## pairwise comparison
pairwise_comp = list(c("KO_naive", "WT_naive"),c("WT_effector","WT_naive"))
## genes of special interest
genes_of_special_interest = c("Il2", "Il4", "Bcl6", "Gata3", "Rin1", "Rin3", "Rin2", "Cxcr5", "Ifng", "Stat4", "Eif5b", "Foxo1", "Sox12", "Ncor2","Jun")
## specify whether output matrix should be exported as txt-file
export_matrix = TRUE
## Overview of samples:
## column_names samplenames groups
## 1 S_10_KO_effector_CD4_plus_S58061 KO_effector_10 KO_effector
## 2 S_11_KO_effector_CD4_plus_S58060 KO_effector_11 KO_effector
## 3 S_12_KO_effector_CD4_plus_S58059 KO_effector_12 KO_effector
## 4 S_1_WT_naive_CD4_plus_S58056 WT_naive_1 WT_naive
## 5 S_2_WT_naive_CD4_plus_S58057 WT_naive_2 WT_naive
## 6 S_3_WT_naive_CD4_plus_S58058 WT_naive_3 WT_naive
## 7 S_4_WT_effector_CD4_plus_S58052 WT_effector_4 WT_effector
## 8 S_5_WT_effector_CD4_plus_S58053 WT_effector_5 WT_effector
## 9 S_6_WT_effector_CD4_plus_S58054 WT_effector_6 WT_effector
## 10 S_7_KO_naive_CD4_plus_S58055 KO_naive_7 KO_naive
## 11 S_8_KO_naive_CD4_plus_S58063 KO_naive_8 KO_naive
## 12 S_9_KO_naive_CD4_plus_S58062 KO_naive_9 KO_naive
## number of genes before filtering for at least 50 counts per gene across all samples: 28299
## number of genes after filtering for at least 50 counts per gene across all samples: 16839
Data normalization is performed using DESeq’s estimateSizeFactors function which estimates sample-wise scaling factors. First the unnormalized input data is visualized:
## plotting barplot and boxplots for unnormalized data:
Next, the normalized data (to be used in all subsequent analysis steps) is visualized.
## plotting barplot and boxplots for normalized data:
## plotting pairwise scatterplots for normalized data:
If normalized properly, we expect most genes to fall around the dashed
line. Note: If there are more than 7 samples in the data, this plot just
picks the first 7 samples.
Note that the data, at this point, is not batch corrected. Potential batch-effects would therefore be visible here. Note that this plot is interactive in the html output! Hover over points to retrieve information.
## plotting PCA:
## plotting global heatmap of normalized log2-transformed counts:
## plotting global heatmap of centered normalized log2 counts (shifted to equal row mean of 0) :
## plotting global heatmap with standardized log2 counts:
Each of these three heatmaps can highlight a different aspect of the data.
Here, a simple one-way ANOVA tests for overall differences between groups in all gene-wise transcript abundance profiles using log2-transformed gene counts (H0: All groups are equal; H1: At least one group differs from the others). If a batch effect is specified, the ANOVA automatically switches to a 2-way ANOVA by incorporating the batch variable as a covariate into the model. Note that this section is entirely skipped if there are no more than 2 groups overall; or if the minimum number of replicates per group is smaller than 3.
## plotting ANOVA p-value histogram:
The bar on the very left reflects p-values < 0.05. Note that per definition, p-values are uniformly distributed if the null hypothesis is true (this applies to any statistical test if the required assumptions for the test are also met - that is why looking at p-value histograms can be quite informative!).
In this section, k-means clustering is performed on log2-transformed gene-wise transcript abundance profiles. Depending on how the parameter “standardize_before_clustering” is specified, log2-transformed gene-wise counts will further either be centered (if set to FALSE), or standardized (if set to TRUE).
Centered implies that log2 counts of each gene are shifted to reach a mean of 0; i.e. a gene-wise subtraction of the mean log2 count is performed on the normalized, log2-transformed data, in order to pick up common abundance signatures of gene transcripts. Importantly, differences in the y-axis range still represent observed log2 fold change differences this way.
Standardized implies that centered log2 counts will further be normalized based on their variance (thus brought to unit variance). This is identical to what is also referred to as Z-scoring. A drawback of this transformation is that any subsequent clustering will also pick up common noise patterns of non-DE genes (e.g. batch effects, etc.), since variances of non-DE genes will be equally pushed to 1, thereby amplifying “noise”. I therefore recommend first filtering for genes that globally differ between groups (in at least 1 group) via ANOVA-derived p-values (this can be specified in the parameter section) when choosing this option. For this analysis, the chosen parameters are:
print(k_clusters)
## [1] 7
print(standardize_before_clustering)
## [1] TRUE
print(ANOVA_filter_before_clustering)
## [1] TRUE
## 10245 genes out of 16839 (60.8%) passed this filter.
The following plot (sometimes called “elbow-plot”) can help to ascertain a reasonable/suitable total number of clusters:
## empirically testing for optimal number of clusters:
For each k-means cluster, the plots below feature: a) gene-wise transcript abundance profiles as dashed black lines. Further, profiles of genes specified as “genes of special interest” (see parameter section) are highlighted individually in color. b) the coordinates of the respective k-means cluster center as colored points (colored according to group identity).
## plotting individual clusters + gene transcript abundance profiles:
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## plotting number of genes per cluster:
In this section, using g:Profiler, gene set functional enrichment analysis via ORA (overrepresentation analysis) is performed on the protein-coding genes of each cluster. If the parameter custom_background is set to TRUE, all quantified protein-coding genes within the experiment are defined as a custom background list (more specifically, all protein-coding genes that passed the initial filtering step based on minimum total read counts). Else, if this parameter is set to FALSE, the enrichment analysis takes the entire annotated genome of the specified organism as a background. g:Profiler then tests for overrepresentation of specific annotated gene sets via Fisher’s Exact Test. These gene sets come from several public databases like GO (Gene Ontology), KEGG (Kyoto Encyclopedia of Genes and Genomes), REAC (Reactome), WP (WikiPathways), HP (Human phenotype ontology), CORUM (CORUM, database of mammalian protein complexes).
The enrichment analysis returns no visual plot output for a cluster when it contains more than 1/3 of the total number of quantified protein-coding genes in the experiment (this setting is intentional in order to limit processing time - the bigger the query list, the more time the calculation takes; but also because the interpretation of an “enrichment” is rather unclear in such a case anyway in my opinion), or when a cluster’s gene list does not produce any statistically significant hits. Moreover, enrichment terms with a term size over 5000 are not displayed (as they are often too generic to be useful), and neither are enrichment terms that produce an intersection set consisting of only a single gene.
The visual output is reminiscent of Manhatten-plots (which is created by an in-build function of the g:Profiler software): The x-axis corresponds to functional terms which are grouped and color-coded according to the different database sources they originate from. The y-axis displays the significance of gene sets tested for enrichment. Each point thus corresponds to a single gene set that was found enriched in the respective cluster, i.e. surpassing a statistical significance threshold as defined in the parameter section. The point size corresponds to the number of genes in the respective gene set.
Note that the individual enrichment testing results are also documented in a separate folder called “enrichment”! Additionally, the file “EnrichmentAnalysis_Userguide.pdf” explains how the data can be interpreted/understood.
The parameters for this analysis were chosen as:
print(enrichment_analysis)
## [1] TRUE
print(custom_background)
## [1] TRUE
print(organism)
## [1] "mmusculus"
print(pval_threshold_enrichment)
## [1] 0.05
print(sources_enrichment)
## [1] "GO" "KEGG" "REAC" "WP" "CORUM"
The localization of terms on the x-axis is not random: Similar, hierarchically connected terms are in close proximity to each other. Also, detailed information on all statistically enriched gene sets is saved separately as tables in the “enrichment” folder. This also includes lists of individual gene names that caused a specific term to appear enriched (see column named: “intersection”) in the respective cluster (i.e. the query gene list). This table is very useful, as it allows filtering for term size (smaller term size -> less generic, i.e. more specific biological pathway/processn information). Finally, heatmaps of transcript abundance profiles for all genes within each cluster are saved separately in the folder named “figures”.
Pairwise differential expression (DE) testing is performed using the R package DESeq2. This package models RNA read counts as negative binomially distributed count data within a generalized linear model setting. Batch effects will be included as fixed effects into the model (i.e. as separate regressor variables). This section returns p-value histograms of pairwise group comparisons, as well as corresponding volcano plots that highlight genes of special interest (as specified in the parameter section). Note that for the volcano plots, fold change shrinkage for low-abundance genes has been performed.
## Experimental design:
## Conducting the following pairwise group comparison
## [1] "KO_naive" "WT_naive"
## Results:
##
## out of 16839 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 163, 0.97%
## LFC < 0 (down) : 87, 0.52%
## outliers [1] : 0, 0%
## low counts [2] : 2612, 16%
## (mean count < 15)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
##
##
##
## Histogram of p-values and adjusted p-values:
## Volcano plot highlighting genes of special interest:
## plotting heatmap of top upregulated genes (log2-transformed centered normalized counts)
## plotting heatmap of top downregulated genes (log2-transformed centered normalized counts)
## Conducting the following pairwise group comparison
## [1] "WT_effector" "WT_naive"
## Results:
##
## out of 16839 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 4611, 27%
## LFC < 0 (down) : 3773, 22%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
##
##
##
## Histogram of p-values and adjusted p-values:
## Volcano plot highlighting genes of special interest:
## plotting heatmap of top upregulated genes (log2-transformed centered normalized counts)
## plotting heatmap of top downregulated genes (log2-transformed centered normalized counts)
In this section, the same volcano plots from above are plotted again but as interactive plots in the html output (hover over points to retrieve gene name information!) while also indicating the significant level of each gene. Adjusted p-values are capped at 10^(-100).
Here, the filtered and normalized data is exported, now also containing the DE testing results as additional columns + k-means cluster information.
## Generated output txt-file file called:
## Matrix_Export_2023-06-06.txt
## (16839 rows, 50 columns)